library(dplyr)
library(ggplot2)
library(DT)
library(here)5 Logistic Regression 3
5.1 Recap of Tutorial 4
library(rms)Loading required package: Hmisc
Attaching package: 'Hmisc'
The following objects are masked from 'package:dplyr':
src, summarize
The following objects are masked from 'package:base':
format.pval, units
load(here("data","tutdata.RData"))
dd <- datadist(tutdata)
options(datadist = "dd")
fit1 <- lrm(y ~ blood.pressure + sex + age + rcs(cholesterol,4), data = tutdata)fit2 <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)), data = tutdata,x=TRUE, y= TRUE)
anova(fit2) Wald Statistics Response: y
Factor Chi-Square d.f. P
blood.pressure 0.26 1 0.6105
sex (Factor+Higher Order Factors) 38.71 5 <.0001
All Interactions 26.13 4 <.0001
age (Factor+Higher Order Factors) 30.42 2 <.0001
All Interactions 3.86 1 0.0495
cholesterol (Factor+Higher Order Factors) 23.78 6 0.0006
All Interactions 22.47 3 0.0001
Nonlinear (Factor+Higher Order Factors) 5.33 4 0.2550
sex * age (Factor+Higher Order Factors) 3.86 1 0.0495
sex * cholesterol (Factor+Higher Order Factors) 22.47 3 0.0001
Nonlinear 4.79 2 0.0911
Nonlinear Interaction : f(A,B) vs. AB 4.79 2 0.0911
TOTAL NONLINEAR 5.33 4 0.2550
TOTAL INTERACTION 26.13 4 <.0001
TOTAL NONLINEAR + INTERACTION 26.81 6 0.0002
TOTAL 62.26 10 <.0001
anova(fit2) Wald Statistics Response: y
Factor Chi-Square d.f. P
blood.pressure 0.26 1 0.6105
sex (Factor+Higher Order Factors) 38.71 5 <.0001
All Interactions 26.13 4 <.0001
age (Factor+Higher Order Factors) 30.42 2 <.0001
All Interactions 3.86 1 0.0495
cholesterol (Factor+Higher Order Factors) 23.78 6 0.0006
All Interactions 22.47 3 0.0001
Nonlinear (Factor+Higher Order Factors) 5.33 4 0.2550
sex * age (Factor+Higher Order Factors) 3.86 1 0.0495
sex * cholesterol (Factor+Higher Order Factors) 22.47 3 0.0001
Nonlinear 4.79 2 0.0911
Nonlinear Interaction : f(A,B) vs. AB 4.79 2 0.0911
TOTAL NONLINEAR 5.33 4 0.2550
TOTAL INTERACTION 26.13 4 <.0001
TOTAL NONLINEAR + INTERACTION 26.81 6 0.0002
TOTAL 62.26 10 <.0001
compare two models using LRT. To use LRT, one model needs to be nested within a larger model.
lrtest(fit1,fit2)
Model 1: y ~ blood.pressure + sex + age + rcs(cholesterol, 4)
Model 2: y ~ blood.pressure + sex * (age + rcs(cholesterol, 4))
L.R. Chisq d.f. P
2.831616e+01 4.000000e+00 1.076137e-05
5.2 Tutorial 5
set.seed(1017)
validate(fit2, B = 100) index.orig training test optimism index.corrected n
Dxy 0.2832 0.3058 0.2682 0.0376 0.2456 100
R2 0.0896 0.1059 0.0795 0.0263 0.0632 100
Intercept 0.0000 0.0000 0.0389 -0.0389 0.0389 100
Slope 1.0000 1.0000 0.8601 0.1399 0.8601 100
Emax 0.0000 0.0000 0.0401 0.0401 0.0401 100
D 0.0684 0.0817 0.0604 0.0213 0.0470 100
U -0.0020 -0.0020 0.0019 -0.0039 0.0019 100
Q 0.0704 0.0837 0.0585 0.0252 0.0452 100
B 0.2316 0.2288 0.2342 -0.0054 0.2369 100
g 0.6230 0.6828 0.5790 0.1038 0.5192 100
gp 0.1457 0.1566 0.1357 0.0209 0.1248 100
- index.orig: The performance metric computed on the full dataset.
- training: The metric computed on the training dataset (bootstrap sample). - test: The metric computed on the test dataset (out-of-bootstrap sample).
- optimism: The difference between training and test performance, estimating overfitting.
- index.corrected: The optimism-adjusted estimate of model performance. n: Number of bootstrap iterations (100)
- g (Gini coefficient): Related to AUC (Area Under the Curve)
- Emax (Maximum Calibration Error): Largest difference between predicted and observed probabilities.
dff <- resid(fit2, "dffits")
plot(dff)show.influence(which.influence(fit2), data.frame(tutdata, dffits = dff), report = c("dffits")) Count sex cholesterol dffits
143 2 female *274.0808 1.6108318
173 4 *female *138.5056 -1.1265158
181 2 *male 146.4978 0.8026558
408 2 *male 148.1649 0.6813596
411 1 female 146.3522 0.5594519
834 2 female *146.6413 -0.4908278
840 2 female *147.0025 -0.5604495
The dffits statistic measures the influence of each observation on the fitted model. It represents the change in the predicted value for a data point if that data point were removed from the model.
# Partial residuals are the residuals from the model, adjusted for the effect of the predictors that are not of primary interest
resid(fit2, "partial", pl = "loess")- Non-linearity: If the loess curve shows a clear non-linear relationship between a predictor and the residuals, you might need to consider transforming that predictor (e.g., using log or polynomial terms) to improve the model fit.
- Outliers or influential points: Any points that deviate significantly from the smoothed curve might be influential or outliers, warranting further investigation.
5.3 Boostrapping
resampling technique used to estimate the sampling distribution of a statistic by repeatedly sampling with replacement from the original dataset, useful when theoretical distributions are unknown or difficult to derive.
for example, say we want to calculate the confidence interval for sample mean, we need to know the distribution of the sample mean:
set.seed(1017)
## pop data
n <- 1000
data <- rnorm(n, mean=5, sd=10)
analytical_lower <- mean(data) - 1.96*sd(data)/sqrt(n)
analytical_upper <- mean(data) + 1.96*sd(data)/sqrt(n)
nboot <- 1000
## sample(data, size=n, replace=TRUE) bootstrap sampling
## mean(sample(data, size=n, replace=TRUE)) calculate sample mean
boot_means <- replicate(nboot, mean(sample(data, size=n, replace=TRUE)))
boot_means %>% density() %>% plot()# percentile
boot_lower_bound <- quantile(boot_means, 0.025)
boot_upper_bound <- quantile(boot_means, 0.975)
cat("Original Sample Mean:", mean(data), "\n")Original Sample Mean: 4.87006
cat("95% CI for Mean (analytical solution):", analytical_lower, analytical_upper, "\n")95% CI for Mean (analytical solution): 4.247337 5.492783
cat("95% Bootstrapped (nboot=1000) CI for Mean:", boot_lower_bound, boot_upper_bound, "\n")95% Bootstrapped (nboot=1000) CI for Mean: 4.220525 5.464462
what would happen if we increase the number of bootstrap?
nboot <- 10000
## sample(data, size=n, replace=TRUE) bootstrap sampling
## mean(sample(data, size=n, replace=TRUE)) calculate sample mean
boot_means2 <- replicate(nboot, mean(sample(data, size=n, replace=TRUE)))
# percentile
boot_lower_bound <- quantile(boot_means2, 0.025)
boot_upper_bound <- quantile(boot_means2, 0.975)
cat("95% CI for Mean (analytical solution):", analytical_lower, analytical_upper, "\n")95% CI for Mean (analytical solution): 4.247337 5.492783
cat("95% Bootstrapped(boot=10000) CI for Mean:", boot_lower_bound, boot_upper_bound, "\n")95% Bootstrapped(boot=10000) CI for Mean: 4.242179 5.499558